40 research outputs found
Greedy-based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning
Due to the representation limitation of the joint Q value function,
multi-agent reinforcement learning methods with linear value decomposition
(LVD) or monotonic value decomposition (MVD) suffer from relative
overgeneralization. As a result, they can not ensure optimal consistency (i.e.,
the correspondence between individual greedy actions and the maximal true Q
value). In this paper, we derive the expression of the joint Q value function
of LVD and MVD. According to the expression, we draw a transition diagram,
where each self-transition node (STN) is a possible convergence. To ensure
optimal consistency, the optimal node is required to be the unique STN.
Therefore, we propose the greedy-based value representation (GVR), which turns
the optimal node into an STN via inferior target shaping and further eliminates
the non-optimal STNs via superior experience replay. In addition, GVR achieves
an adaptive trade-off between optimality and stability. Our method outperforms
state-of-the-art baselines in experiments on various benchmarks. Theoretical
proofs and empirical results on matrix games demonstrate that GVR ensures
optimal consistency under sufficient exploration
Prioritized Planning for Target-Oriented Manipulation via Hierarchical Stacking Relationship Prediction
In scenarios involving the grasping of multiple targets, the learning of
stacking relationships between objects is fundamental for robots to execute
safely and efficiently. However, current methods lack subdivision for the
hierarchy of stacking relationship types. In scenes where objects are mostly
stacked in an orderly manner, they are incapable of performing human-like and
high-efficient grasping decisions. This paper proposes a perception-planning
method to distinguish different stacking types between objects and generate
prioritized manipulation order decisions based on given target designations. We
utilize a Hierarchical Stacking Relationship Network (HSRN) to discriminate the
hierarchy of stacking and generate a refined Stacking Relationship Tree (SRT)
for relationship description. Considering that objects with high stacking
stability can be grasped together if necessary, we introduce an elaborate
decision-making planner based on the Partially Observable Markov Decision
Process (POMDP), which leverages observations and generates the least
grasp-consuming decision chain with robustness and is suitable for
simultaneously specifying multiple targets. To verify our work, we set the
scene to the dining table and augment the REGRAD dataset with a set of common
tableware models for network training. Experiments show that our method
effectively generates grasping decisions that conform to human requirements,
and improves the implementation efficiency compared with existing methods on
the basis of guaranteeing the success rate.Comment: 8 pages, 8 figure
ESMC: Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint
Large-scale online recommender system spreads all over the Internet being in
charge of two basic tasks: Click-Through Rate (CTR) and Post-Click Conversion
Rate (CVR) estimations. However, traditional CVR estimators suffer from
well-known Sample Selection Bias and Data Sparsity issues. Entire space models
were proposed to address the two issues via tracing the decision-making path of
"exposure_click_purchase". Further, some researchers observed that there are
purchase-related behaviors between click and purchase, which can better draw
the user's decision-making intention and improve the recommendation
performance. Thus, the decision-making path has been extended to
"exposure_click_in-shop action_purchase" and can be modeled with conditional
probability approach. Nevertheless, we observe that the chain rule of
conditional probability does not always hold. We report Probability Space
Confusion (PSC) issue and give a derivation of difference between ground-truth
and estimation mathematically. We propose a novel Entire Space Multi-Task Model
for Post-Click Conversion Rate via Parameter Constraint (ESMC) and two
alternatives: Entire Space Multi-Task Model with Siamese Network (ESMS) and
Entire Space Multi-Task Model in Global Domain (ESMG) to address the PSC issue.
Specifically, we handle "exposure_click_in-shop action" and "in-shop
action_purchase" separately in the light of characteristics of in-shop action.
The first path is still treated with conditional probability while the second
one is treated with parameter constraint strategy. Experiments on both offline
and online environments in a large-scale recommendation system illustrate the
superiority of our proposed methods over state-of-the-art models. The
real-world datasets will be released